# Load Necessary Packages
library(tidyverse)
library(tidymodels)
library(naniar)
library(skimr)
library(cowplot)
# Set Seed
set.seed(3206)The arts, from plays to musicals to live jazz, give us hope, meaning, connection and community. Particularly local arts brings us closer to the communities we engage with on a daily basis. Further, by better understanding who attends arts performances and exhibits and how often they do, as well as why other people don’t, we can, hopefully, make the arts more accessible to everyone. Thus, for our predictive modeling project, we will be using a dataset we from the Local Area Arts Participation study from 1992. This study was funded by the National Endowment for the Arts (Research Division) and conducted by Abt Associates of Cambridge, MA. It contains information on whether and how often Americans attend and participate in various kinds of local art, including jazz concerts, ballet performances, etc. It also contains various demographic characteristics of the respondents. In collecting the data, Abt Associates randomly selected Americans from 12 localities via a random-digit-dialing sampling methodology and contacted them by phone for this survey.
In total, there are 5,040 total observations and 95 attributes. 72 of the characteristics are factored data while the other 23 are numeric data. The categorical characteristics generally involve whether the respondent attended or participated in local art, as well as demographic data. The numerical features include the number of times the respondent attended that specific event as well as demographic data. The data is read in below, and named local_arts_data.
# Load the Data
load(file = "data/processed/local_arts_data.Rda")There appears to be significant missingness within some characteristics of the data. There are three reasons that this missingness occurs:
The majority of NA values are due to the second reason—that the question did not apply to the respondent. For instance, the columns with the most missing values, bar5 and bar4, ask about the fourth and fifth reasons for why the respondent did not attend cultural and artistic events more often. However, the previous three questions, bar1, bar2, and bar3, ask the same question, and it’s likely that most respondents didn’t have more than three reasons for not attending more cultural/artistic events. Another example is mags, which asks what magazines the respondent read to learn about art events in the community, which would be dependent on whether the respondent read magazines and cited them as a source for learning about art events. For those predictors with missing values that do not fall under that category, it is more likely that the respondent did not know the answer to the question than that they refused to answer the question or that the question didn’t apply to them.
Below is a table of the missingness in the local arts data.
# Create function to select columns with missing values
has_na <- function(x){
any(is.na(x))
}
# Graph missingness
local_arts_data %>%
# select only columns that have missing values
select_if(has_na) %>%
# summary of missing variables
miss_var_summary() %>%
print(n = 20)## # A tibble: 80 x 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 bar5 5029 99.8
## 2 bar4 4946 98.1
## 3 mags 4913 97.5
## 4 more8 4823 95.7
## 5 bar3 4612 91.5
## 6 more7 4385 87.0
## 7 radio 3943 78.2
## 8 more6 3738 74.2
## 9 tvtype 3642 72.3
## 10 bar2 3590 71.2
## 11 more5 2969 58.9
## 12 mostimp 2370 47.0
## 13 more4 2229 44.2
## 14 more3 1570 31.2
## 15 bar1 1483 29.4
## 16 newsp 1432 28.4
## 17 moremost 1124 22.3
## 18 more2 1009 20.0
## 19 over18 940 18.7
## 20 income 915 18.2
## # … with 60 more rows
The truncated table above shows that there are 77 variables with missingness in our data set. In the table, variables from bar5 down to newsp have over 20% of their values as NA. For the variables with more than 20% missingness, it would probably make sense to remove most of them, as more than 20% of observations are a lot to impute. We decided to remove 14 predictors with a significant amount of missingness (bar5, bar4, mags, more8, bar3, more7, radio, more6, tvtype, bar2, more5, more4, more3, newsp). The code for that is below.
local_arts_data <- local_arts_data %>%
select(-bar5, -bar4, -mags, -more8, -bar3, -more7, -radio, -more6,
-tvtype, -bar2, -more5, -more4, -more3, -newsp)We decided to keep one variable that had greater than 20% missingness, specifically mostimp, as that question relates to the most important reason the participant did not attend arts events more often.
For the rest of the variables with missingness at or below 20%, an imputation step will be used to replace those missing values.
In our predictive modeling project, we will attempt to predict the survey respondent’s household income, the variable income, using the other variables in the data-set related to arts attendence and demographic information as predictors. This is a categorical variable with 9 levels indicating the range of the respondent’s household income. Before conducting an initial split on the data into a training and testing set, we wanted to visualize our outcome variable.
# Distribution of income (graph)
local_arts_data %>%
ggplot(aes(x = income)) +
geom_bar() +
coord_flip() +
labs(
title = "Distribution of Respondent's Household Income"
) +
theme_minimal()# Table of the levels of income
local_arts_data %>%
count(income)## # A tibble: 10 x 2
## income n
## <fct> <int>
## 1 Under $10,000 315
## 2 $10,000 to $14,999 288
## 3 $15,000 to $19,999 375
## 4 $20,000 to $29,999 721
## 5 $30,000 to $39,999 679
## 6 $40,000 to $49,999 540
## 7 $50,000 to $74,000 698
## 8 $75,000 to $99,999 243
## 9 $100,000 or more 266
## 10 <NA> 915
As the graph and table above show, there is a significant amount of missingness in the outcome variable, with a total of 915 missing values. Most of this missingness is because participants didn’t know their household income or because the respondent refused to answer the question, with some NA values being truly missing responses. For the rest of the data, the categories with the highest counts are $50,000 to $74,999, $30,000 to $39,999, and $20,000 to $29,999. Categories with the lowest counts are $75,000 to $99,000 and $100,000 or more. Before splitting the data into a training and test set, it would make sense to get rid of the rows that contain missing values, since it would not make sense (nor would it be possible) to predict missing values. The code for that is below.
# Only keep observations with values for income
local_arts_data <- local_arts_data %>%
filter(!is.na(income))We decided to initially split the data into 70% training, and 30% testing. The imbalance between the various classes of income shown above prove that it would be important to stratify our initial split (and later our cross-validation) by income. Below is the code for splitting the data.
# Conduct Initial Split
local_arts_split <- initial_split(data = local_arts_data, prop = 0.7, strata = income)
# Collect Training Data
local_arts_train <- training(local_arts_split)After performing some initial cleaning and processing of the dataset, there are 74 variables remaining, 73 of which may serve as predictor variables. There are two sets of variables indicating participation in the arts: one set that is categorical, indicating whether or not the participant has gone to the indicated arts event in the last 12 months. The other set is numeric, indicating how many times the participant has gone to the indicated arts event in the last 12 months. Using both sets of variables might be a bit redundant, so we may focus on the numeric set rather than the categorical set, as that would give us a more nuanced look at not only whether or not the participant has been to certain arts events but the frequency with which they have attended those events. Their participation in the arts most likely correlates with their income bracket, as high participation in the arts may signal more available time for leisure activities.
Taking a look at both the relationship between both the categorical and numeric sets of predictor variables regarding participation in the arts and our outcome variable income shows that there is definitely a trend.
# combined graph of categorical arts participation variables
local_arts_train %>%
select(income, jazz, classic, opera, musical, play, ballet, museum, fair) %>%
gather(-income, key = "var", value = "value") %>%
ggplot(aes(x = income, fill = value)) +
geom_bar(position = "fill") +
coord_flip() +
facet_wrap(~ var, scales = "free") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45)) +
labs(
title = "Attendance at Arts Activities by Income",
fill = "Have you\nattended\n(*) in the last\n12 months?",
x = "prop"
)As you can see above, for every kind of local art activity, from live jazz, to a museum, to a staged play, the proportion of people who have attended in the last 12 months increases as the income brackets increases. For almost every local art activity, the income bracket with the most people who attended in the last 12 months is the $100,000 or more income bracket. This makes sense: those with more money have the financial ability and time to attend local arts events. The same pattern seems to hold when we look at the numeric representation of arts attendance and participation.
# combined graph of numeric arts participation variables
local_arts_train %>%
select(income, starts_with("n")) %>%
filter(!is.na(income)) %>%
select(., 1:10) %>%
gather(-income, key = "var", value = "value") %>%
ggplot(aes(x = value, y = income)) +
geom_boxplot() +
facet_wrap(~ var, scales = "free") +
theme_minimal() +
coord_cartesian(xlim = c(0, 10)) +
theme(axis.text.x = element_text(angle = 45)) +
labs(
title = "Frequency of Attendance at Arts Activities by Income",
x = "number of times attending"
)It’s a little hard to see a relationship for a lot of the variables since so many people went zero times to a local arts performance. But for some variables, there does seem to be a positive relationship between the frequency of arts attendance and income. For example, people with an income of 100,000 dollars or more and those with an income between 75,000 and 99,999 visited museums in the last twelve months a median of 2 times, while those with lower household incomes visited museums on the whole less frequently. In addition, people with a household income of 20,000 dollars or more had a higher median number of visits to art fairs (nfair) than those with an income under 20,000 dollars.
Below is another graph of income in relation to more variables indicating the number of times the respondent participated in a specific activity in the last 12 months. nbooks refers to the number of books the respondent read; npark refers to the number of times the respondent visited a history park/monument; ntvart refers to the number of times the respondent watched a visual arts program on TV/VCR; ntvclass refers to the number of times the respondent watched a classical music program on TV/VCR; ntvdance refers to the number of times a respondent watched a dance program on TV; ntvjazz refers to the number of times a respondent watched a jazz program on TV; ntvmus refers to the number of times a respondent watched a musical on TV; ntvopera refers to the number of times a respondent watched opera on TV, and ntvplay refers to the number of times a respondent watched a play on TV.
# combined graph of numeric arts participation variables 2
local_arts_train %>%
select(income, starts_with("n")) %>%
filter(!is.na(income)) %>%
select(., 1, 11:19) %>%
gather(-income, key = "var", value = "value") %>%
ggplot(aes(x = value, y = income)) +
geom_boxplot() +
facet_wrap(~ var, scales = "free") +
theme_minimal() +
coord_cartesian(xlim = c(0, 20)) +
theme(axis.text.x = element_text(angle = 45)) +
labs(
title = "Frequency of Engaging in/Watching Arts Activities by Income",
x = "number of times engaging"
)Again, since the majority of people watched or engaged in most of the above arts activities zero times in the last twelve months, there doesn’t appear to be much of a relationship between most of the variables and income, except for nbooks and npark. There is a very clear relationship between the income of a respondent and the number of books they read in the last twelve months, with those with the highest income reading the most and those with the lowest income reading the least. There also seems to be a slight relationship between the income of a respondent and the number of times they visited a historic park or monument, with those with the highest income visiting historic sites slightly more frequently then those with the lowest income.
Another predictor variable related to arts participation/engagement that we think would be interesting to investigate in relation to income is whether or not a person visited the movies in the past twelve months.
local_arts_data %>%
ggplot(aes(income, fill = cinema)) +
geom_bar(position = "fill") +
coord_flip() +
theme_minimal() +
labs(title = "Going to the movies by income",
fill = "Have you gone to movie theater\nto see a movie\nin last 12 months?",
y = "prop")The graph pretty clearly shows that more people with higher incomes went to see a movie in the theaters in the last twelve months than people with lower incomes.
While the above variables provide broad trends as to the frequency of the participants’ attendance to their local arts events, it may be just as important as to understand why they have and why they haven’t.
For instance, we can take a look at the variable bar1, which indicates respondent’s first reason for not attending local arts events/not attending them more often.
# distribution of bar1
local_arts_train %>%
mutate(
bar1 = bar1 %>% fct_infreq() %>% fct_rev()
) %>%
ggplot(aes(bar1)) +
geom_bar() +
coord_flip() +
labs(
x = "reason for not attending"
) +
theme_minimal()As you can see above, the most common reason for people not attending local arts events more frequently is that they don’t have time, followed by the cost of tickets and lack of childcare. It could also be interesting to see how these reasons for not attending local arts events change (or remain the same) across levels of income.
# distribution of bar1 by income
local_arts_train %>%
mutate(
bar1 = bar1 %>% fct_infreq() %>% fct_rev()
) %>%
ggplot(aes(bar1)) +
geom_bar() +
facet_wrap(~ income) +
coord_flip() +
labs(
x = "reason for not attending"
) +
theme_minimal()It seems that across all income levels, lack of time is the primary reason for not attending local arts events more frequently.
Other variables that ask about the reason behind their lack of attendance include:
mostimp: most important reason participant did not attend arts events more often
howimp: how important it is for the participant to attend events
Another important factor to bear in mind when observing and predicting survey respondent’s household income is their demographic information. These variables include:
education level
gender
age
household size
marital status
Each of these seem as if they could provide valuable insight into the income of the participant. For example, income generally increases with education and age, while a larger household size may indicate higher income with more financial stability.
Below is a plot of the highest education level by income.
local_arts_train %>%
select(income, educ, race, gender, age, hhsize) %>%
filter(!is.na(income)) %>%
ggplot(aes(income, fill = educ)) +
geom_bar(position = "fill") +
coord_flip() +
theme_minimal() +
labs(
title = "Distribution of Education Levels across Income Brackets",
fill = "Education Level",
y = "prop"
)As hypothesized, there appears to be a correlation between education level and income. The proportion of people with bachelor’s degrees and with grad degrees increases as income increases.
The next demographic factor is gender. (Race was included as a secondary finding, rather than a primary finding.)
local_arts_train %>%
select(income, educ, gender, age, hhsize) %>%
filter(!is.na(income)) %>%
ggplot(aes(income, fill = gender)) +
geom_bar(position = "fill") +
coord_flip() +
theme_minimal() +
labs(
title = "Distribution of Gender across Income Brackets",
fill = "Gender",
y = "prop"
)Across nearly all the income brackets, the proportion of females outnumber the proportion of males, except in the highest income bracket.
local_arts_train %>%
select(income, educ, gender, age, hhsize) %>%
filter(!is.na(income)) %>%
ggplot(aes(age)) +
geom_bar() +
coord_flip() +
facet_wrap(~ income) +
theme_minimal() +
labs(
title = "Distribution of Age across Income Brackets",
x = "Age",
y = "Count"
)The distribution of ages is unimodal in all income brackets, with many of the graphs having a left tail. As the income bracket increases, the plot appears to move “upwards.” In other words, the age increases as the income increases, indicating a positive relationship between the two. This would make sense: younger adults tend to have less money due to their lack of experience. Older adults have more experience in the workforce and may be more financially responsible, leading to a higher income.
local_arts_train %>%
select(income, educ, gender, age, hhsize) %>%
filter(!is.na(income)) %>%
ggplot(aes(hhsize)) +
geom_bar() +
coord_flip() +
facet_wrap(~ income, scales = "free") +
theme_minimal() +
labs(
title = "Distribution of Household Size across Income Brackets",
x = "Household Size",
y = "Count"
)The distribution of household sizes is heavily left skewed in all income brackets. As the income bracket increases, the household size increases as well. One noteworthy observation is that one point, specifically located in the income bracket of $30,000 to $39,999, appears to be an outlier, with a value of 73 for the household size. In future analysis, the household size for that point may be imputed, or the observation may be removed altogether as this is likely a data collection error.
local_arts_train %>%
select(income, marital) %>%
filter(!is.na(income)) %>%
ggplot(aes(income, fill = marital)) +
geom_bar(position = "fill") +
coord_flip() +
theme_minimal() +
labs(
title = "Distribution of Marital Status across Income Brackets",
y = "prop",
fill = "Marital Status"
)Clearly, as the the income increases, so do the proportion of married couples. This makes sense: married participants have their primary source of income as well as their partner’s source of income, leading to a higher household income.
The secondary findings mainly consist of variables that were explored in the process of this exploratory data analysis, but fail to present interesting or surprising results.
Using the caret package, we used nearZeroVar() to identify which variables had little to no variance. Such values would fail to provide meaningful insight or significantly affect the model.
local_arts_train %>%
select(caret::nearZeroVar(local_arts_train)) %>%
colnames()## [1] "abtid" "nopera" "source4" "source5" "exchange"
Thus, these variables should be left out of the dataset and recipe.
Many questions revolved around preferences of the participant. One such example is lisjazz, which explores whether the participant listened to jazz performances on radio, tapes, or compact discs. A graph of its distribution is below.
p1 <- local_arts_train %>%
select(starts_with("lis")) %>%
ggplot(aes(lisjazz)) +
geom_bar() +
coord_flip() +
theme_minimal() +
labs(x = "Count",
y = "Listened to Jazz?")
p2 <- local_arts_train %>%
select(starts_with("lis")) %>%
ggplot(aes(lisclass)) +
geom_bar() +
coord_flip() +
theme_minimal() +
labs(x = "Count",
y = "Listened to Classical?")
p3 <- local_arts_train %>%
select(starts_with("lis")) %>%
ggplot(aes(lisopera)) +
geom_bar() +
coord_flip() +
theme_minimal() +
labs(x = "Count",
y = "Listened to Opera?")
p4 <- local_arts_train %>%
select(starts_with("lis")) %>%
ggplot(aes(lismus)) +
geom_bar() +
coord_flip() +
theme_minimal() +
labs(x = "Count",
y = "Listened to Music?")
p5 <- local_arts_train %>%
select(starts_with("lis")) %>%
ggplot(aes(lisplay)) +
geom_bar() +
coord_flip() +
theme_minimal() +
labs(x = "Count",
y = "Listened to Play?")
plot_grid(p1, p2, p3, p4, p5)Since the mediums are widely accessible (radio, tapes, or compact discs), whether the participant listened to a type of music seems like it is more about the participant’s preference, rather than any meaningful information about their income level. This is also true of the questions that ask about what kind of events the participant would like to attend more of as well.
p1 <- local_arts_train %>%
select(starts_with("more")) %>%
ggplot(aes(more1)) +
geom_bar() +
coord_flip() +
theme_minimal() +
labs(x = "Count",
y = "I want to attend more...?")
p2 <- local_arts_train %>%
select(starts_with("more")) %>%
ggplot(aes(more2)) +
geom_bar() +
coord_flip() +
theme_minimal() +
labs(x = "Count",
y = "I want to attend more...?")
plot_grid(p1, p2)One demographic characteristic considered as a potential factor was race. Below is the plot relating race with income.
local_arts_train %>%
select(income, race) %>%
filter(!is.na(income)) %>%
ggplot(aes(race)) +
geom_bar() +
coord_flip() +
facet_wrap(~ income) +
theme_minimal() +
labs(
title = "Race by Income",
x = "Count",
y = "Race"
)There appears to be little diversity. Instead, the predominate count is White, not Hispanic, which far outnumbers the other races.